EPITA 2022 IML lab03_classification_01-fashion_mnist v2022-03-15_180239 by G. Tochon & J. Chazalon

Creative Commons License This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

Lab 3: Classification and Model Selection

 Overview

Goals

In the session you will practice to get more familiar with the following concepts:

Parts

This notebook contains the only part of the lab session.

What you need to do

Make sure you read and understand everything, and complete all the required actions. Required actions are preceded by the following sign: Work

Dataset

We will use the FashionMNIST dataset, created by Zalando Research.

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

Content

Here's an example how the data looks (each class takes three rows):

Labels

Each training and test example is assigned to one of the following classes:

Label Description
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

0. Setup

Imports

Data loading

We first download the dataset, or load it if we alread have it...

If you are working on Windows, you will need to adapt those lines, or, altenatively, download the dataset directly from the official repo and put the files under the same directory.

...then we open it...

...and we plot a random selection of images.

work **Question:** Is the train set **balanced**?

TODO answer

Teacher:

Yes it is balanced (same number of elements for each class).

work **Question:** Is the train set **sorted**?

TODO answer

TEACHER:

No, is is not sorted (by class). We must be careful when sampling data.

work **Question:** Does the train set look noisy (at first sight)? *Hint:* Display the first images for each class.

TODO answer

TEACHER:

Some classes may be hard to discriminate, but intrinsically they look quite consistent (at first sight). Some samples are hard to identify though.

1. Train a tee-shirt/top vs pull-over classifer (class 0 vs class 2)

In the first exercise we will leverage scikit-learn's great classifiers to quickly get a baseline.

To avoid premature complexity, we will first focus on the case of a binary classifier, i.e. a case where data can only be classified into two classes. We usually label those classes 0 and 1.

work Train a classifier to discriminate images from class 0 ("tee-shirt/top") and class 2 ("pull-over"). *Hints:* - Use the `select_2_classes()` function below to generate a train and a test set. - Use a `LinearSVC` classifier with default parameters (but custom seeding for reproducibility), unless you have some particular classifier you want to try. Note that most of the questions assume you will use the `LinearSVC` classifier.

Create the classifier

2. Evaluation

2.1. Qualitative evaluation / visual inspection

Now that your classifier is trained, you should first check the results visually for a qualitative control.

work **Using the `plot_some_results` and `plot_some_errors` functions provided below, display some predictions from the test set and control that they make sense.**

Do the errors make sense?

Can you say for sure whether it is a tee-shirt/top or a pull-over in each case?

2.2. Accuracy

We are now going to evaluate the performance of the classifier using the integrated score method.

The score method reports the accuracy:

\begin{equation} \large \mathrm{Accuracy} = \frac{\text{Correct predictions}}{\text{All predictions}} \end{equation}

In the next section we will provide a more precise definition.

work Implement the function `my_accuracy` below to get the same result.

We should get the same result as before here:

2.3. Confusion Matrix

The accuracy is a very basic indicator which weights all errors equally. However, the cost of misclassification from A to B is not always the same as the cost of recognizing A instead of B (think of a fraud detection system for example).

The confusion matrix is the key to understand the core indicators: accuracy, precision, recall, etc.

For a binary classifier, labels are either 1 (True) or 0(False), and the confusion matrix is composed of only 4 elements:

Let us look at the confusion matrix of your classifier…

work What are the counts of true positives, true negatives, false positives and false negatives produces by your classifier?

TODO answer

The matrix is arranged in this way:

Predicted True Predicted False
Expected True TP FN
Expected False FP TN

2.4. Precision, Recall, F-Score

Based on these terms, the accuracy has the following definition:

\begin{equation} \large \mathrm{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} \end{equation}

And precision and recall are:

\begin{equation} \large \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \end{equation}\begin{equation} \large \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \end{equation}
work Compute the precision and recall of your previous classification results.

The F-score, finally, is the Harmonic mean of precision and recall:

\begin{equation} \large F_1 = \frac{2}{\mathrm{recall^{-1}} + \mathrm{precision^{-1}}} = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} = \frac{\mathrm{TP}}{\mathrm{TP} + \frac12 (\mathrm{FP} + \mathrm{FN}) } \end{equation}

2.5. PR and ROC curves

For most classifiers, it is possible to rank their prediction on the test set by probability, confidence, or some other form of score.

With scikit-learn, classifiers which can provide such information implement either:

In the case of LinearSVN, the decision_function() returns the confidence score associated to each sample. The confidence score for a sample is proportional to the signed distance of that sample to the hyperplane.

Using either of these methods, is is possible to plot, for each possible threshold, the numbers of TP, FP, FN, TN… or any other indicator derived from them.

Some intuition

work Here is (below) the distribution of confidence score produced by the decision function (assuming you used a LinearSVC) **for the train set**. **Where are the majority of the values?**

TODO some thoughts about this distribution…

It show that the majority of values are in $[-10, 10]$ and pretty much centered around zero.

What is more interesting is to plot the distribution of such values according to the class the samples belong to.

work Based on the previous plot, assuming each error (FP and FN) have the same cost, what is roughly the optimal decision threshold (on the x-axis) which will minimize the total error?

TODO answer

In the previous plot, the intersection between the two distributions is the set of samples which will be wrongly classified in the training set.

The optimal threshold, if all errors have the same cost, will be at the intersection of the two distributions.

Precision and Recall

sklearn.metrics.precision_recall_curve provides a very nice way to compute all the values that precision and recall would take by selecting each of the possible thresholds.

Here we will look at the values from the training set (because we will try to calibrate the decision function in the next section).

We provide you with some extra visualization function which can help.

work By default, the `predict()` method will assume that values which are $\geq 0$ are `True` and values $\lt 0$ are `False`. **Is $0$ the best possible threshold here to maximize the accuracy? To maximize $F_1$?**

TODO what about the calibration?

The calibration is bad. We can fix it in several ways, as we will see in the next two sections: post-correction and feature pre-processing.

Before looking a the calibration of the predictor in more detail, let us discover the PR and ROC curves.

Precision vs Recall curve

Using precision and recall values computed for each threshold, it is possible to plot precision(t) vs recall(t) for each t (threshold).

This gives an idea of the different operation modes our system could have based on the threshold we choose:

Receiver Operating Characteristic (ROC) curve

The ROC curve, finally, plots the true positive rate (aka sensitivity, recall, hit rate) vs the false positive rate (aka fall out, 1 minus specificity, etc.)

\begin{equation} \large \mathrm {TPR} ={\frac {\mathrm {TP} }{\mathrm {P} }}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }}=1-\mathrm {FNR} = \text{Recall} \end{equation}\begin{equation} \large \mathrm {FPR} ={\frac {\mathrm {FP} }{\mathrm {N} }}={\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TN} }}=1-\mathrm {TNR} \end{equation}

It illustrates a kind of signal vs noise compromise: what is the amount of false positive (e.g. background noise) you will have to face to increase the recall (e.g. voice signal over radio transmitter).

3. Calibration

Calibration should be done on validation set but we can illustrate the process here on the train set directly. As long as we do not calibrate on the test set, this cannot be very wrong…

Calibration (here) is about how to find the threshold which maximizes some metric.

Let us first compute the uncalibrated predictions, and compute their accuracy on the test set.

Now, let us compute a new threshold on the train set which will maximize the accuracy, and evaluate the quality of the new decision on the test set.

work Complete the code of the function `find_best_threshold_for_accuracy` below to compute the new decision with a new threshold, and analyse the results on the test set.
work Is the accuracy on the test set better?

TODO Answer

Quite an improvement indeed… And calibration is even more important for deep networks!

But we may be doing things the wrong way: we know our features are not normalized!

Let us try some…

4. Preprocessing

and learn how to use pipelines.

Pipelines are an elegant way to combine preprocessors (which have fit() and transform() methods) with predictors (wich have fit() and predict() methods) into a general object.

We encourage you to use this integrated way of combining pre-processing and classification because it prevents from forgetting to apply the same pre-processing to test data; a very common mistake!

To create pipelines, we can either use the Pipeline constructor, or the factory function make_pipeline. Their signature is slightly different.

work Create a pipeline which combines a `StandarScaler` with the same `LinearSVC` as we used before; the train it on our binary case and evaluate its performance on our test set.
work **How does the performance compares the our previous version?** **Is the new predictor calibrated differently?**

TODO Answer

TEACHER

work What are the step run by sklearn **during the training** in terms of calls to `fit()`, `transform()` and `predict()` for the preprocessor and for the classifier? What are the step run by sklearn **during the testing/prediction** in terms of calls to `fit()`, `transform()` and `predict()` for the preprocessor and for the classifier?

TODO answer

Training process:

Prediction process:

5. Cross validation for better model selection

Here we want to get the best possible model without looking at the test set.

We will split the train set into several parts, training on all but one part and evaluating on the left one, then repeating the operation with another part left out.

We will keep the best performing model trained with this process.

work Using [`sklearn.model_selection.StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) to separate the train set into train and **validation** subsets automatically, train several models and keep the best one based on its accuracy **on the validation set**. Finally, evaluate the performance of the best model **on the test set**. *Hint*: look at the example in the documentation of [`sklearn.model_selection.StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold).
work Is our new model performing better? What would be good usages of cross validation? Do we really need a stratified splitter here? *Some recommended readings:* - API: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection - Cross validation: https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation - Hyper-parameter tuning: https://scikit-learn.org/stable/modules/grid_search.html#grid-search - More about estimator's variance: https://scikit-learn.org/stable/modules/learning_curve.html#learning-curve

TODO answer

TEACHER

Model performance and variance reduction

Comparing to training without cross-validation, this is not a tremendous gain here, but it can help when models have a lot of variance.

This is also a way to generate several models which can be ensembled.

Usages of cross-validation/selection

Let us now try to optimize the meta-parameters of our predictor.

We will continue to work on 2 classes for now because it may be slow on all classes.

work Complete the code below to recreate the pipeline you used previously and set the grid parameters so you explore different values for the `C` parameter of our `LinearSVC`.

Pretty printing of the parameters

work Did we end up using a value for `C` different from the default one?

TODO Answer

TEACHER

yes, it seems that a different value is better adapted to our data.

7. Multiclass version: all classes

In the section, we will train a multi-class classifier, and we will have a quick look at the multi-class classification strategies: OvO (one versus one) and OvR (one versus rest, aka one versus all).

work How many classifiers do we need to train for the 10 classes for our dataset with: - OvO strategy? - OvA strategy?

TODO answer

TEACHER

OvA: $10$

OvO: $(10*10 - 10) / 2 = 45$ (all combinations minus same class divided by 2 because of symmetry)

work Using a `SGDClassifier` (which will have less troubles with all this data), train a prediction system on the complete training set, and evaluate its performance on the test set. Do not forget to display some results and errors to make sure they make sense, and plot the confusion matrix. Make sure your system's performance is way above the expected performance a random system would have!

TEACHER

A random system would get a 10% accuracy on this balanced dataset; so 82% is much better than random guess.

Bonus: Somes classes are easy to discriminate

or "Fashion MNIST is like MNIST: it is too easy."

Train a boot vs trouser classifer (class 9 vs class 1)

Quickly, to see that some classes are really easy to discriminate in the problem.

Prepare train set

Train classifier

Create test set

Those classes seem pretty easy to discriminate…

No error on test set!

8. More classifiers!

Now you can try various classifiers, with appropriate evaluation. Some suggestions: k nearest neighbor, linear regression, SVM with non-linear kernel (RBF typically), random forest…

9. Meta-optim again, full problem, several classifiers, strat kfold, multiple seeding…

Using cross-validation try to find the best performing classifier with the best parameters from scikit-learn to solve the full problem (all classes).

Be ready of a night of computation. Do not burn your laptop; use some desktop computer or server!